mean_tweedie_deviance (Mean Tweedie Deviance)#

mean_tweedie_deviance is a regression metric / loss from the Tweedie family of distributions. It is a great fit when targets are non-negative, often skewed, and may contain many zeros.

Typical examples:

  • insurance pricing: claim cost per policy (many zeros + positive continuous)

  • rainfall amounts (many zeros + positive continuous)

  • event counts (Poisson special case)


Learning goals#

  • Understand the Tweedie power parameter p and special cases (Normal / Poisson / Gamma)

  • Write the metric with clear math + domain constraints

  • Implement mean_tweedie_deviance from scratch in NumPy (incl. weights + edge cases)

  • Build intuition with plots: asymmetry, effect of p, and sensitivity to extreme errors

  • Use the deviance as a training loss: fit a simple Tweedie GLM with gradient descent

Prerequisites#

  • Basic calculus (derivatives)

  • Familiarity with regression + linear models

  • Helpful: GLMs and link functions (we use a log link)

2.5) Probabilistic view: deviance as (scaled) negative log-likelihood#

For Tweedie GLMs, the deviance comes from the log-likelihood. You can view it as the extra negative log-likelihood you pay by predicting \(\hat{\mu}\) instead of the (unrealistic) perfectly-fitting saturated model.

For an exponential dispersion model, the (unit) deviance can be written as:

\[ d_p(y, \hat{\mu}) = 2\,\big( \mathcal{L}(\hat{\mu}) - \mathcal{L}(y) \big) \]

where \(\mathcal{L}(\mu)\) is the part of the negative log-likelihood that depends on \(\mu\). For \(p\notin\{0,1,2\}\) (ignoring constants that don’t depend on \(\mu\)):

\[ \mathcal{L}(\mu) = \frac{\mu^{2-p}}{2-p} - \frac{y\,\mu^{1-p}}{1-p} \]

and the saturated model corresponds to setting \(\mu=y\) for each sample. So minimizing mean Tweedie deviance is (up to a scale) the same as maximum likelihood for the Tweedie family.

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.linear_model import TweedieRegressor
from sklearn.metrics import mean_tweedie_deviance

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

1) The Tweedie family (why this metric exists)#

The Tweedie family is an exponential dispersion model with a mean–variance relationship:

\[ \mathrm{Var}(Y\mid \mu) = \phi\,\mu^{p} \]
  • \(\mu = \mathbb{E}[Y]\) is the mean

  • \(\phi>0\) is a dispersion (scale) parameter

  • \(p\) is the power parameter: it determines the distribution type and how variance scales with the mean

In scikit-learn, mean_tweedie_deviance is defined for p \le 0 or p \ge 1. Common special cases:

power p

distribution (typical)

support

p = 0

Normal

real-valued

p = 1

Poisson

counts (0,1,2,…)

1 < p < 2

Compound Poisson–Gamma

mass at 0 + positive continuous

p = 2

Gamma

positive continuous

p = 3

Inverse Gaussian

positive continuous

The main “wow” case is 1 < p < 2: it models data that is often zero but otherwise continuous and skewed.

2) Definition#

Let \(y_i\) be targets and \(\hat{\mu}_i\) be predictions (think: predicted mean). The per-sample Tweedie deviance is:

For \(p \notin \{0,1,2\}\):

\[ d_p(y, \hat{\mu}) = 2\left[ \frac{y^{2-p}}{(1-p)(2-p)} - \frac{y\,\hat{\mu}^{1-p}}{1-p} + \frac{\hat{\mu}^{2-p}}{2-p} \right] \]

Special cases:

  • \(p=0\) (Normal):

    \[d_0(y, \hat{\mu}) = (y-\hat{\mu})^2\]
  • \(p=1\) (Poisson):

    \[d_1(y, \hat{\mu}) = 2\left( y\log\frac{y}{\hat{\mu}} + \hat{\mu}-y \right),\quad 0\log 0 := 0\]
  • \(p=2\) (Gamma):

    \[d_2(y, \hat{\mu}) = 2\left( \log\frac{\hat{\mu}}{y} + \frac{y}{\hat{\mu}} - 1 \right)\]

The scikit-learn metric is the mean deviance:

\[ \mathrm{mean\_tweedie\_deviance}(y, \hat{\mu}) = \frac{1}{n}\sum_{i=1}^n d_p(y_i, \hat{\mu}_i) \]

Domain constraints (important)#

scikit-learn enforces these domains:

  • p < 0: \(\hat{\mu} > 0\) (targets may be any real number)

  • p = 0: no constraints (\(y,\hat{\mu}\in\mathbb{R}\))

  • 1 \le p < 2: \(y \ge 0\) and \(\hat{\mu} > 0\)

  • p \ge 2: \(y > 0\) and \(\hat{\mu} > 0\)

# Quick demo (using scikit-learn)
y_true = np.array([2.0, 0.0, 1.0, 4.0])
y_pred = np.array([0.5, 0.5, 2.0, 2.0])

print('power=1   (Poisson):', mean_tweedie_deviance(y_true, y_pred, power=1))
print('power=1.5 (Tweedie):', mean_tweedie_deviance(y_true, y_pred, power=1.5))

y_true_pos = np.array([2.0, 1.0, 4.0, 3.0])
y_pred_pos = np.array([1.5, 0.7, 2.5, 3.4])

print('power=0 (MSE):', mean_tweedie_deviance(y_true_pos, y_pred_pos, power=0))
print('power=2 (Gamma):', mean_tweedie_deviance(y_true_pos, y_pred_pos, power=2))
power=1   (Poisson): 1.4260151319598084
power=1.5 (Tweedie): 1.7781745930520232
power=0 (MSE): 0.6875
power=2 (Gamma): 0.12753010019949473

3) NumPy implementation (from scratch)#

We’ll implement two functions:

  • tweedie_deviance_per_sample_np → returns \(d_p(y_i, \hat{\mu}_i)\) for each sample

  • mean_tweedie_deviance_np → averages the per-sample deviances (optionally weighted)

Goal: match scikit-learn behavior, including input validation.

def _check_1d_float(x, name):
    x = np.asarray(x, dtype=float)
    if x.ndim != 1:
        raise ValueError(f'{name} must be 1D, got shape {x.shape!r}')
    return x


def _xlogy(x, y):
    """Compute x * log(y) with the convention 0 * log(y) = 0.

    This avoids the RuntimeWarning you get from np.where(..., 0, x*np.log(y)).
    """

    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)

    x, y = np.broadcast_arrays(x, y)
    out = np.zeros_like(x, dtype=float)

    mask = x != 0
    out[mask] = x[mask] * np.log(y[mask])
    return out


def tweedie_deviance_per_sample_np(y_true, y_pred, *, power=0.0):
    y_true = _check_1d_float(y_true, 'y_true')
    y_pred = _check_1d_float(y_pred, 'y_pred')
    if y_true.shape[0] != y_pred.shape[0]:
        raise ValueError('y_true and y_pred must have the same length')

    p = float(power)
    if not (p <= 0 or p >= 1):
        raise ValueError('power must be <= 0 or >= 1')

    msg = f'Mean Tweedie deviance error with power={p} can only be used on '
    if p < 0:
        if np.any(y_pred <= 0):
            raise ValueError(msg + 'strictly positive y_pred.')
    elif p == 0:
        pass
    elif 1 <= p < 2:
        if np.any(y_true < 0) or np.any(y_pred <= 0):
            raise ValueError(msg + 'non-negative y and strictly positive y_pred.')
    elif p >= 2:
        if np.any(y_true <= 0) or np.any(y_pred <= 0):
            raise ValueError(msg + 'strictly positive y and y_pred.')

    if p < 0:
        # Matches scikit-learn: use y_true^... only when y_true > 0.
        y_pos = np.where(y_true > 0, y_true, 0.0)
        dev = 2 * (
            np.power(y_pos, 2 - p) / ((1 - p) * (2 - p))
            - y_true * np.power(y_pred, 1 - p) / (1 - p)
            + np.power(y_pred, 2 - p) / (2 - p)
        )
    elif p == 0:
        dev = (y_true - y_pred) ** 2
    elif p == 1:
        # 2 * ( y*log(y/mu) - y + mu )
        dev = 2 * (_xlogy(y_true, y_true / y_pred) - y_true + y_pred)
    elif p == 2:
        dev = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
    else:
        dev = 2 * (
            np.power(y_true, 2 - p) / ((1 - p) * (2 - p))
            - y_true * np.power(y_pred, 1 - p) / (1 - p)
            + np.power(y_pred, 2 - p) / (2 - p)
        )

    return dev


def mean_tweedie_deviance_np(y_true, y_pred, *, power=0.0, sample_weight=None):
    dev = tweedie_deviance_per_sample_np(y_true, y_pred, power=power)
    if sample_weight is None:
        return float(np.mean(dev))

    w = _check_1d_float(sample_weight, 'sample_weight')
    if w.shape[0] != dev.shape[0]:
        raise ValueError('sample_weight must have the same length as y')
    if np.any(w < 0):
        raise ValueError('sample_weight must be non-negative')
    if float(np.sum(w)) == 0.0:
        raise ValueError('sum of sample_weight must be > 0')

    return float(np.sum(w * dev) / np.sum(w))
# Verify that our implementation matches scikit-learn
for p in [0, 1, 1.5, 2, 3, -1]:
    if p == 0:
        y = rng.normal(size=300)
        mu = rng.normal(size=300)
    elif p < 2:
        y = rng.poisson(lam=2.0, size=300).astype(float)
        mu = np.exp(rng.normal(size=300))
    else:
        y = np.exp(rng.normal(size=300))
        mu = np.exp(rng.normal(size=300))

    w = rng.uniform(0.1, 2.0, size=y.shape[0])

    sk = mean_tweedie_deviance(y, mu, power=p)
    npv = mean_tweedie_deviance_np(y, mu, power=p)

    sk_w = mean_tweedie_deviance(y, mu, power=p, sample_weight=w)
    np_w = mean_tweedie_deviance_np(y, mu, power=p, sample_weight=w)

    print(f'power={p:>4}: abs diff={abs(sk-npv):.3e}, abs diff (weighted)={abs(sk_w-np_w):.3e}')
power=   0: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power=   1: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power= 1.5: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power=   2: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power=   3: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power=  -1: abs diff=0.000e+00, abs diff (weighted)=0.000e+00

4) Intuition with plots#

Two important intuitions:

  1. Asymmetry (for p >= 1): underpredicting a positive target is usually punished much more than overpredicting.

  2. Power p controls extreme sensitivity: larger p tends to reduce the penalty for very large deviations.

We’ll visualize the deviance as a function of the multiplicative factor:

\[\hat{\mu} = y \cdot f\]

where \(f<1\) means underprediction and \(f>1\) means overprediction.

y0 = 10.0
f = np.logspace(-2, 2, 600)  # multiplicative factor
mu = y0 * f

powers = [0, 1, 1.5, 2, 3]

fig = go.Figure()
for p in powers:
    dev = tweedie_deviance_per_sample_np(np.full_like(mu, y0), mu, power=p)
    fig.add_trace(go.Scatter(x=f, y=np.log1p(dev), mode='lines', name=f'p={p}'))

fig.update_layout(
    title='Tweedie deviance vs multiplicative error (plotted as log(1 + deviance))',
    xaxis_title='factor f in μ̂ = y · f (log scale)',
    yaxis_title='log(1 + deviance)',
)
fig.update_xaxes(type='log')
fig.add_vline(x=1.0, line_dash='dash', line_color='rgba(0,0,0,0.4)')
fig.show()

Why underprediction can hurt a lot#

For p > 0, the loss includes terms like \(\log\hat{\mu}\) (Poisson / Gamma special cases) or powers of \(\hat{\mu}\). This makes the deviance blow up when:

  • \(y>0\) and \(\hat{\mu}\to 0^+\) (severe underprediction)

In optimization, this shows up directly in the gradient.

mu_grid = np.logspace(-3, 3, 600)
y_fixed = 10.0

fig = go.Figure()
for p in [1, 1.5, 2, 3]:
    # For p!=0, the derivative wrt μ is: ∂d/∂μ = 2 * μ^{-p} * (μ - y)
    grad_mu = 2 * (mu_grid ** (-p)) * (mu_grid - y_fixed)
    fig.add_trace(go.Scatter(x=mu_grid, y=np.abs(grad_mu), mode='lines', name=f'|∂d/∂μ|, p={p}'))

fig.update_layout(
    title='Gradient magnitude vs prediction μ̂ (fixed y=10): small μ̂ can create huge gradients',
    xaxis_title='μ̂ (log scale)',
    yaxis_title='|∂d/∂μ| (log scale)',
)
fig.update_xaxes(type='log')
fig.update_yaxes(type='log')
fig.show()

5) Using Tweedie deviance to fit a model (low-level optimization)#

mean_tweedie_deviance is not only an evaluation metric: it is also a natural training loss for Tweedie GLMs.

We’ll fit a 1D log-link model:

\[ \eta_i = b_0 + b_1 x_i \qquad \hat{\mu}_i = \exp(\eta_i) \]

The log link guarantees \(\hat{\mu}_i>0\), which is required for p \ne 0.

Objective#

\[ J(b_0,b_1) = \frac{1}{n}\sum_{i=1}^n d_p(y_i, \hat{\mu}_i) \]

Gradient (for p != 0)#

A useful identity (valid for p != 0) is:

\[ \frac{\partial d_p}{\partial \hat{\mu}} = 2\,\hat{\mu}^{-p}(\hat{\mu}-y) \]

With the log link \(\hat{\mu}=e^{\eta}\), we have \(\frac{\partial \hat{\mu}}{\partial \eta}=\hat{\mu}\), so:

\[ \frac{\partial d_p}{\partial \eta} = 2\,\hat{\mu}^{1-p}(\hat{\mu}-y) \]

Then:

\[ \frac{\partial J}{\partial b_0} = \frac{1}{n}\sum_{i=1}^n \frac{\partial d_p}{\partial \eta_i} \qquad \frac{\partial J}{\partial b_1} = \frac{1}{n}\sum_{i=1}^n x_i\,\frac{\partial d_p}{\partial \eta_i} \]

We’ll use gradient descent to minimize \(J\) for a synthetic dataset that looks like compound Poisson data (1 < p < 2).

# Synthetic compound Poisson–Gamma data (mass at 0 + positive continuous)
n = 500
x = rng.uniform(-2, 2, size=n)

b0_true = 0.2
b1_true = 0.8

# True mean curve
mu_true = np.exp(b0_true + b1_true * x)

# Compound Poisson–Gamma construction:
#   N ~ Poisson(lambda)
#   Severity ~ Gamma(k, theta)
#   Y = sum_{j=1..N} Severity_j
# Then Y has a point mass at 0 and a continuous positive tail.

severity_shape = 2.0
severity_scale = 0.5  # mean severity = shape * scale = 1.0
sev_mean = severity_shape * severity_scale

lam = mu_true / sev_mean
counts = rng.poisson(lam)

y = np.zeros(n)
mask = counts > 0
y[mask] = rng.gamma(shape=counts[mask] * severity_shape, scale=severity_scale)

fig = make_subplots(rows=1, cols=2, subplot_titles=('y vs x', 'y distribution'))
fig.add_trace(go.Scatter(x=x, y=y, mode='markers', marker=dict(size=5, opacity=0.6), name='y'), row=1, col=1)
fig.add_trace(go.Histogram(x=y, nbinsx=60, name='hist(y)'), row=1, col=2)
fig.update_xaxes(title_text='x', row=1, col=1)
fig.update_yaxes(title_text='y', row=1, col=1)
fig.update_xaxes(title_text='y', row=1, col=2)
fig.update_yaxes(title_text='count', row=1, col=2)
fig.update_layout(title='Synthetic compound Poisson-like regression target', showlegend=False)
fig.show()
def exp_clip(z, clip=20.0):
    """exp(z) with clipping to avoid overflow during training."""

    return np.exp(np.clip(z, -clip, clip))


def tweedie_glm_loss_and_grad(b0, b1, x, y, *, power):
    """Return (loss, db0, db1) for μ̂ = exp(b0 + b1 x)."""

    eta = b0 + b1 * x
    mu = exp_clip(eta)

    loss = mean_tweedie_deviance_np(y, mu, power=power)

    p = float(power)
    if p == 0:
        raise ValueError('This helper assumes power != 0 (use MSE directly for p=0).')

    # ∂d/∂η = 2 * (μ^(2-p) - y * μ^(1-p))
    grad_eta = 2 * (mu ** (2 - p) - y * mu ** (1 - p))

    db0 = float(np.mean(grad_eta))
    db1 = float(np.mean(grad_eta * x))
    return loss, db0, db1


def fit_tweedie_glm_gd(x, y, *, power, lr=0.05, n_steps=600):
    b0, b1 = 0.0, 0.0

    history = {
        'step': [],
        'b0': [],
        'b1': [],
        'loss': [],
    }

    for step in range(n_steps):
        loss, db0, db1 = tweedie_glm_loss_and_grad(b0, b1, x, y, power=power)

        history['step'].append(step)
        history['b0'].append(b0)
        history['b1'].append(b1)
        history['loss'].append(loss)

        b0 -= lr * db0
        b1 -= lr * db1

    return (b0, b1), history
power = 1.5

(b0_hat, b1_hat), hist = fit_tweedie_glm_gd(x, y, power=power, lr=0.05, n_steps=700)

mu_hat = exp_clip(b0_hat + b1_hat * x)

print(f'true params: b0={b0_true:.3f}, b1={b1_true:.3f}')
print(f'GD   params: b0={b0_hat:.3f}, b1={b1_hat:.3f}, mean deviance={mean_tweedie_deviance_np(y, mu_hat, power=power):.4f}')

fig = go.Figure()
fig.add_trace(go.Scatter(x=hist['step'], y=hist['loss'], mode='lines', name='mean deviance'))
fig.update_layout(
    title='Gradient descent on mean Tweedie deviance',
    xaxis_title='step',
    yaxis_title='mean Tweedie deviance',
)
fig.show()

# Visualize fitted mean curve
x_line = np.linspace(x.min(), x.max(), 300)
mu_line_true = np.exp(b0_true + b1_true * x_line)
mu_line_hat = np.exp(b0_hat + b1_hat * x_line)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode='markers', name='data', marker=dict(size=5, opacity=0.55)))
fig.add_trace(go.Scatter(x=x_line, y=mu_line_true, mode='lines', name='true E[y|x]', line=dict(dash='dash')))
fig.add_trace(go.Scatter(x=x_line, y=mu_line_hat, mode='lines', name='GD fitted mean'))
fig.update_layout(title='Fitting the mean with Tweedie deviance (log link)', xaxis_title='x', yaxis_title='y')
fig.show()
true params: b0=0.200, b1=0.800
GD   params: b0=0.192, b1=0.783, mean deviance=1.7361
# Visualize the loss landscape over (b0, b1) and the GD path
# (This is not a quadratic bowl like MSE, but is often nicely behaved for 1 <= p <= 2.)

b0_grid = np.linspace(b0_hat - 1.4, b0_hat + 1.4, 120)
b1_grid = np.linspace(b1_hat - 1.4, b1_hat + 1.4, 120)

B0, B1 = np.meshgrid(b0_grid, b1_grid)
ETA = B0[:, :, None] + B1[:, :, None] * x[None, None, :]
MU = exp_clip(ETA)

p = float(power)
# p=1.5 here (general formula)
const = np.power(y, 2 - p) / ((1 - p) * (2 - p))
DEV = 2 * (
    const[None, None, :]
    - y[None, None, :] * np.power(MU, 1 - p) / (1 - p)
    + np.power(MU, 2 - p) / (2 - p)
)
Z = np.mean(DEV, axis=2)

stride = max(1, len(hist['step']) // 140)
b0_path = np.array(hist['b0'])[::stride]
b1_path = np.array(hist['b1'])[::stride]

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=b0_grid,
        y=b1_grid,
        z=Z,
        contours_coloring='heatmap',
        colorbar=dict(title='mean deviance'),
    )
)
fig.add_trace(
    go.Scatter(
        x=b0_path,
        y=b1_path,
        mode='lines+markers',
        name='GD path',
        marker=dict(size=4, color='black'),
        line=dict(color='black'),
    )
)
fig.update_layout(
    title='Mean Tweedie deviance landscape + gradient descent trajectory',
    xaxis_title='b0 (intercept)',
    yaxis_title='b1 (slope)',
)
fig.show()

6) Practical usage (scikit-learn)#

As a metric:

from sklearn.metrics import mean_tweedie_deviance
mean_tweedie_deviance(y_true, y_pred, power=1.5)

As a model loss: scikit-learn provides TweedieRegressor, which minimizes a Tweedie deviance objective (optionally with L2 regularization via alpha).

Choosing power is usually treated as a hyperparameter (cross-validate over a small grid like [1.1, 1.3, 1.5, 1.7, 1.9]).

X = x.reshape(-1, 1)

reg = TweedieRegressor(power=power, link='log', alpha=0.0, max_iter=5000)
reg.fit(X, y)

mu_sk = reg.predict(X)

print('sklearn intercept_:', float(reg.intercept_))
print('sklearn coef_     :', reg.coef_)
print('mean deviance (sklearn fit):', mean_tweedie_deviance(y, mu_sk, power=power))
print('mean deviance (our GD fit) :', mean_tweedie_deviance(y, exp_clip(b0_hat + b1_hat * x), power=power))
sklearn intercept_: 0.19246904943966447
sklearn coef_     : [0.7827]
mean deviance (sklearn fit): 1.7361175245303142
mean deviance (our GD fit) : 1.7361175244957865

7) Pros, cons, and when to use it#

Pros#

  • Principled: corresponds to (scaled) negative log-likelihood for Tweedie GLMs

  • Handles zeros + positive continuous when 1 < p < 2 (compound Poisson)

  • Tunable via p: bridges MSE (p=0), Poisson deviance (p=1), Gamma deviance (p=2)

  • Naturally heteroscedastic: matches data where variance grows with the mean

Cons#

  • Requires selecting a good power parameter p (mis-specification hurts)

  • Domain constraints: y_pred must be strictly positive for p != 0 (and y_true must be positive for p >= 2)

  • Less “unit-interpretable” than MAE/MSE (involves powers/logs)

  • Can be numerically harsh if a model predicts values near 0 while targets are positive

Good use cases#

  • p=1 Poisson: count data (calls per hour, clicks per day, defects per batch)

  • 1 < p < 2 compound Poisson: claim cost, rainfall, “amount spent” with many zeros

  • p=2 Gamma: positive continuous targets with multiplicative noise (severity, durations, rates)

8) Common pitfalls + diagnostics#

  • Check your target support: if you have zeros, avoid p >= 2.

  • Ensure positivity of predictions for p != 0 (log link or clipping with a small epsilon).

  • Watch for huge contributions: plot per-sample deviances to spot underprediction near \(\hat{\mu}\approx 0\).

  • Choose p deliberately: cross-validate; 1 < p < 2 is often a strong starting point for zero-heavy positive targets.

9) Exercises#

  1. Implement d2_tweedie_score and show how it relates to the mean deviance.

  2. On the synthetic dataset above, try p in {1.1, 1.3, 1.5, 1.7, 1.9} and compare fitted curves and final loss.

  3. Modify the data generator to increase the fraction of zeros. Does the best p shift?

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_tweedie_deviance.html

  • scikit-learn user guide (Tweedie regression): https://scikit-learn.org/stable/modules/linear_model.html#tweedie-regression

  • Jørgensen, B. (1997). The Theory of Dispersion Models.